This project goal is to reproduce two figures from FiveThirtyEight’s Congress Today Is Older Than It’s Ever Been. Specifically, I plan to reproduce the 1) line plot about the median age of the U.S. Senate and U.S. House by Congress from 1919 to 2023; and 2) the stacked cumulative area plot showing share of members in Congress from each generation, 1919 to 2023.
In order to organize the data into the appropriate format for making figures, there are three R packages dplyr, stringr, and tidyr for preprocessing. To generate interactive figures as in the original post, two additional R packages are loaded (plotly and RColorBrewer)
library(dplyr)
library(stringr)
library(tidyr)
library(plotly)
library(RColorBrewer)
Only one dataset is needed to reproduce the two figures, which I have
downloaded and stored under the ‘data/’ folder. Since both figures use
‘year’ as the x-axis and the related column start_date
chooses date as the unit, I first create a new column by extracting the
year information from ‘start_date’.
setwd("~/Library/CloudStorage/OneDrive-HarvardUniversity/Documents/PHD/Course/2024winter/BST270/BST270_project_XX")
data = read.csv('data/data_aging_congress.csv') %>%
mutate(year = str_extract(start_date, '[0-9]{4}'))
The first figure shows the median age of the U.S. Senate and U.S.
House by Congress from 1919 to 2023. To aggregate data by year and
chamber and calculate the median age, i utilize the
group_by function to group members of Congress by their
chamber and the year in which they were seated. Then i use the
summarise function to calculate the median age within each
group. This result in 106 different age groups (53 years * 2
chambers).
data_1 = data %>% group_by(year, chamber) %>%
summarise(median_age = round(median(age_years), 1))
head(data_1)
## # A tibble: 6 × 3
## # Groups: year [3]
## year chamber median_age
## <chr> <chr> <dbl>
## 1 1919 House 49.7
## 2 1919 Senate 56.8
## 3 1921 House 50.6
## 4 1921 Senate 57.4
## 5 1923 House 51.3
## 6 1923 Senate 58.6
I then use plot_ly to generate the lineplot. The shape
is set as ‘hv’ and the color is defined by different chambers. The
figure is identical to the one on the post.
data_1 %>%
plot_ly(x = ~year, colors = "Set1") %>%
add_trace(y = ~median_age, color = ~chamber, type = 'scatter',
mode = 'line', line = list(shape = "hv"))
Figure 2 shows the share of members in Congress from each generation from 1919 to 2023. To derive the proportion, I first aggregate the data by year and count the total number of Congress members in each Congress.
data_tmp = data %>% count(year)
I then aggregate by year and generation and derive the number of each generation in each Congress.
data_tmp2 = data %>% count(year, generation)
By merging the two temporary datasets together aligned by
year, I can then derive the proportion of generation.
data_count = merge(data_tmp, data_tmp2, by = 'year') %>%
mutate(prop_gen = round(n.y / n.x * 100, 1))
In order to generate the stacked area plot in the way the post figure
has, I transform the long-format data_count into the wide
format, where each generation corresponds to a column. I manually
replace the NA value as 0.
data_count_wide = pivot_wider(data_count, id_cols = c('year'), names_from = 'generation', values_from = 'prop_gen')
data_count_wide[is.na(data_count_wide)] = 0
In the end, we use plot_ly and specify
stackgroup = 'one' and groupnorm = 'percent'
to create the stacked cumulative area plot. Generations are added to the
figure by adding the add_trace and specifying
y as the target generation. The resulting figure is
identical to the one shown in the post.
color = brewer.pal(n = 10, 'Paired')
fig <- plot_ly(data_count_wide, x = ~year, y = ~Progressive, name = 'Progressive', type = 'scatter', mode = 'none', stackgroup = 'one', groupnorm = 'percent', fillcolor = color[1])
fig <- fig %>% add_trace(y = ~Missionary, name = 'Missionary', fillcolor = color[2])
fig <- fig %>% add_trace(y = ~Lost, name = 'Lost', fillcolor = color[3])
fig <- fig %>% add_trace(y = ~Greatest, name = 'Greatest', fillcolor = color[4])
fig <- fig %>% add_trace(y = ~Silent, name = 'Silent', fillcolor = color[5])
fig <- fig %>% add_trace(y = ~Boomers, name = 'Boomers', fillcolor = color[6])
fig <- fig %>% add_trace(y = ~`Gen X`, name = 'Gen X', fillcolor = color[7])
fig <- fig %>% add_trace(y = ~Millennial, name = 'Millennial', fillcolor = color[8])
fig <- fig %>% add_trace(y = ~Gilded, name = 'Gilded', fillcolor = color[9])
fig <- fig %>% add_trace(y = ~`Gen Z`, name = 'Gen Z', fillcolor = color[10])
fig <- fig %>% layout(title = 'Share of members in Congress from each generation, 1919 to 2023',
xaxis = list(title = "Year",
showgrid = FALSE),
yaxis = list(title = "Proportion",
showgrid = TRUE,
ticksuffix = '%'))
fig